Vinho Verde: Bringing Together Measurements and Reported Quality by Paul Haller

Univariate Plots Section

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Looks like we will be looking at how quality, the only categorical variable, interacts with other (combinations of) variables, as well es interactions between variable that seem intuitively linked, e.g. free vs total sulfur dioxide or pH vs acidity levels. For most variables, the median and mean are quite close to each other, suggesting normal distributions. Let’s take a look at a selection of them:

Just to get an overview, I’ve plotted 10 possibly interesting variables as histograms, leaving out chlorides and density. I left out the last two, as I think the analysis will for the most part rest on examining the relationship between quality and some combination of other variables. From the description that came with the dataset, it seemed unlikely that either of these two variables impact taste too much, so to simplify things, I left them out. All histograms apart from residual sugar and alcohol approximate a normal distribution. most skewing right slightly.

Quality ranges from 3 to 9, with the vast majority of wines being around a 5 or 6, i.e. squarely in the middle of the range (from 0-10).

Alcohol distribution looks fairly normal, skewed right and with perhaps a little bimodal bump at ca. 12.5?

Taking the log10 of alcohol makes the histogram appear a little more “normal”, but at this stage does not help me understand the alcohol variable better.

Residual sugar skews right very strongly, with the highest peak at about 1.

Taking the log10 of residual sugar and using a bin width of 0.05 gives us what looks like a binomial distribution. So there is another cluster of values in the tail. Let’s look at an even smaller bin size and see what that shows us.

Now we can see the smaller “bump”" between 0.75 and 1.25. So this shows us in a bit more detail what is happening in the tail of the histogram.

Subsetting for high quality wines (8 and 9) shows they also have higher alcohol levels (lower histogram).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.65   12.60   14.00

This summary confirms what the plots showed us: high quality wines tend to have high(er) alcohol levels. This is shown most clearly by the higher median and by the fact that the 1st Quartile is a lot higher for the higher quality wines.

Choosing a smaller binwidth lets us see a weird spike at 0.5. Let’s zoom in on it.

Looks less “extreme” like this, but still noticeable. Is it just a fluke?

Taking the data subset for higher quality, we see a far narrower band of citric acid levels. Does a specific range of citric acid (not too little and not too much) lead to wines tasting better? Perhaps in combination with another variable.

Fixed acidity shows a nice normal distribution. There is no spike here coresponding to the citric acid variable, as the bulk of the measurements are above 3, whila almost all values for citric acid are below 1 g / dm^3.

This distribution has a slightly longer right tail than fixed acidity. Is there very slight bump at ca. 0.5? Let’s zoom in.

There doesn’t seem to be anything very interesting there. I also noticed that it probably doesn’t make sense to directly compare volatile to citric acidity, as there are simply far higher levels of volatile acidity compared to citric acidity.

What is the structure of your dataset?

There are 4898 obs. of 13 variables, with the first variable just being a counter of the samples, so it can be excluded for the most part. All the other variables are quantitative apart from “quality”" which is categorical (values from 0-10).

What is/are the main feature(s) of interest in your dataset?

I would say, how quality relates to the other variables. Which chemical properties of the wines correlate in which way with quality?

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Several variables seem prima facie connected, e.g.free and total sulfur dioxide as well as the acidity variables. (However, I might be wrong, especially concerning citric acid)

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Residual sugar and alcohol were very strongly right-skewed and had a possibly binomial distribution, respectively. Citric acid has a big peak at ca. 0.5 1 g / dm^3. I subsetted the data to only include the best rated wines (quality). I did this to get a first impression of what properties highly rated wines have. Preliminarily I would say that higher rated wines seem to have higher levels of alcohol. The high quality wines also seem to have a smaller range of citric acid levels.

Bivariate Plots Section

The variable of interest, the one you’d want to predict, would be quality. But there don’t seem to be many strong (direct) correlations, apart from density vs. residual sugar.

So from about wines with quality 7, more wines have higher alcohol contents than lower. For wines quality 5 and 6, the majority of samples have lower (below 11 and below 10) levels of alcohol. So this bears out what we saw in the matrix - that higher quality wines have a higher alcohol content.

The same data as a boxplot: this shows at a glance how the average alcohol content of wine samples (per quality rating) increases with higher quality ratings.

A general trend of lower residual sugar the more alcohol a given wines contains is visible and gets more pronounced the more alcohol a sample contains. There is also one extreme outlier with a very high amount of sugar- more than double the amount of the vast bulk of samples.

Would the line of best fit change significantly if I remove that one extreme outlier?

This scatterplot and line seem to show a stronger relationship between sugar and alcohol than the last chart did- by how much did the correlation actually change?

## 
##  Pearson's product-moment correlation
## 
## data:  df$alcohol and df$residual.sugar
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312
## 
##  Pearson's product-moment correlation
## 
## data:  subset(df, df$residual.sugar < 60)$alcohol and subset(df, df$residual.sugar < 60)$residual.sugar
## t = -36.192, df = 4895, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4812780 -0.4370779
## sample estimates:
##        cor 
## -0.4594624

There was barely any change in the correlation. Probably this is due to the fact I only removed one value. But the resulting scatterplot shows the previously mentioned relationship a little more clearly, as it is “zoomed in” compared to the plot showing the extreme outlier.

Comparing residual sugar and quality directly shows there is no clear ( i.e. hardly any) relationship between the two variables. I do find this surprising, as one would assume that the sweetness of a wine would affect the percieved quality in some way (negatively or positively). We can see though, thinking back to the histogram of residual sugar, that most samples have very low levels of residual sugar, i.e. close to 1 g / dm^3.

pH and fixed acidity- why do they only have a medium correlation? pH gives oyu how basic or acidic a substance/ liquid is, I would have expected a very high correlation, but it is only ca. 0.4- why? Perhaps because the pH values of the samples are actually very close together?

Showing the pH scale from zero to 14 on the y-axis shows the narrow band of pH values relative to fixed acidity… wait. How is that even possible? It means that the fixed acidity measures something different from pH!

## 
##  Pearson's product-moment correlation
## 
## data:  no_acidity_outliers$pH and no_acidity_outliers$fixed.acidity
## t = -32.81, df = 4886, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4476053 -0.4016516
## sample estimates:
##        cor 
## -0.4249022

Actually reduces correlation ever so slightly.

So density is within a pretty narrow range, except for that one extreme outlier. It is at the same quality rating (six) as the outlier for residual sugar. As sugar and density have the highest correlation, this is to be expected, we will see this clearly in the next plot.
We can also see that higher quality wines tend to have a lower density, i.e. the correlation is negative.

This is a plot of the strongest direct relationship in the data. We can clearly see the diagonal shape the data points create, indicating a strong positive correlation.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Surprisingly, to me at least, there are no very strong direct relationships between quality and the quantative variables. The variable with the strongest relationship to quality is alcohol, followed by density. Alcohol and density are clearly related (cor 0.4), due I assume to the chemical properties of liquids containing alcohol. I would then have expected the correlation between density and quality to be stronger.
As density correlates highly with residual sugar I will be looking at how density, residual sugar and alcohol relate to quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Total sulfur dioxide and free sulfur dioxide correlate strongly, unsurprisingly. The fact that pH and acidity don’t have a stronger (negative) relationship surprises me a little, as pH is just a measure of how basic or acidic a substance is.

What was the strongest relationship you found?

The highest correlation is density to residual sugar at ca. 0.8.

Multivariate plots

This plot shows that the higher alcohol/ lower sugar wines tend to be considered higher quality. Except, there are some very high quality wines- between 11-13% alcohol, that have a fairly high sugar content (between 5 and 15 g / dm^3) compared to most of the good/ ok wines with high alcohol content having little sugar, below 5 g / dm^3 I’d say. Let’s try swapping quality and alcohol around to see if that helps visually:

OK, so here we can see better that the wines with lower alcohol content (redder) tend to have more residual sugar. This can be seen most clearly be looking at higher quality wines (>=7) that tend to have higher alcohol contents but if they have high residual sugar, they then tend to have lower levels of alcohol. As quality increases, residual sugar goes down, but as there are fewer samples of high quality wines, it’s harder to say how siginficant this pattern is. Remembering the histogram of residual sugar again: most values are around 1 g / dm^3, we can see that in the columns too, most values (overplotted) are at the bottom, i.e. around 1).

Here we can see quite nicely the inverse relationship between density and alcohol, and to a lesser extent quality (The orange and yellow data points (quality 7 and 8) are clustered to the bottom left of the plot).

Here I have added an additional dimension (residual sugar) to the previous chart: Larger plot points indicate higher levels of residual sugar. I am trying to visualise how residual sugar interacts with the other plotted variables. I hope I am not reading to much into this visualisation but I would say that as density decreases and alcohol levels increase, that plot points also decrease (except for the outliers). I would also provisionally say that there seems to be a sweet spot for quality at ca 12.5% alcohol, with generally lower residual sugar.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I think I was trying to show that alcohol, density and sugar all work together in explaining quality. But I am not so sure this is clearly borne out by what I have explored so far. If I were to build a model, I would be interested to see if residual sugar (in addition to density and alcohol) helps predict wine quality or not.

Were there any interesting or surprising interactions between features?

Due to the clear relationships between sugar and density and density and alcohol, I was assuming that bringing these variables together would show far more clearly or strikingly the relationship between all 3 of these variables and ultimately, quality.

Final Plots and Summary

Plot One

Description One

On a logarithmic scale, we can see a slightly bimodal histogram. This shows that the values in the tail are mainly just below or just above 1, giving a bit more detail about the distribution of the values in teh tail of this histogram.

Plot Two

Description Two

The plot points of these two variable line up very nicely, approximating a diagonal line from bottom left to top right. As there are far more samples around 1 g/dm^3 for residual sugar, the data “thins out” from left to right. A few extreme outliers are also visible, but the corellation of high residual sugar resulting in high density holds for them too.

Plot Three

Description Three

This scatterplot shows the relationship of three quantative variables to the single qualitative variable, i.e. quality. As density decreases and alcohol increases, quality tends to increase too. The relationship between sugar and quality is quite weak, but residual sugar is clearly linked to density and to a lesser extent alcohol also.


Reflection

Considering that these samples are all of a “natural” product, wine, I expected the distribution of most of the measured variables to be normal. Alcohol and residual sugar did not fit this picture very well, so I thought they might be interesting variables to explore. I also intuitively assumed that the sweetness and alcohol content of wine would reflect on its percieved quality. I was surprised that there were so few strong correlations, I expected the different chemical properties of the wines to be more closely linked, for some reason. The clearest one here was density and residual sugar. And since density and alcohol are correlated, I somehow also assumed that in combination they would show a clearer picture. I found this tricky to show with these scatterplots. As I mentioned in the analysis on the multivariate plots, I believe modelling these relationships might improve the understanding of these relationships.